Skip to content

[CORE] Make LeafTransformSupport's getPartitions return Seq[Partition]#10838

Merged
zhztheplayer merged 1 commit intoapache:mainfrom
Zouxxyy:dev/fix-scan
Oct 16, 2025
Merged

[CORE] Make LeafTransformSupport's getPartitions return Seq[Partition]#10838
zhztheplayer merged 1 commit intoapache:mainfrom
Zouxxyy:dev/fix-scan

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Oct 3, 2025

What changes are proposed in this pull request?

InputPartition is actually just a DSv2 interface API, Partition is common.
LeafTransformSupport's getPartitions should return Seq[Partition], for example:

FileSourceScanExecTransformer return Seq[FilePartition]
BatchScanExecTransformer return Seq[DataSourceRDDPartition]

Especially after Spark proposed Storage-Partitioned Join, a DataSourceRDDPartition may contain multi InputPartitions. See DataSourceRDD:

class DataSourceRDD(
    sc: SparkContext,
    @transient private val inputPartitions: Seq[Seq[InputPartition]],
    partitionReaderFactory: PartitionReaderFactory,
    columnarReads: Boolean,
    customMetrics: Map[String, SQLMetric])
  extends RDD[InternalRow](sc, Nil)

  override protected def getPartitions: Array[Partition] = {
    inputPartitions.zipWithIndex.map {
      case (inputPartitions, index) => new DataSourceRDDPartition(index, inputPartitions)
    }.toArray
  }
}

Due to this issue, some if-else codes were introduced previously, this PR cleans them up.

How was this patch tested?

@github-actions github-actions bot added CORE works for Gluten Core VELOX DATA_LAKE labels Oct 3, 2025
@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

@Zouxxyy Zouxxyy marked this pull request as draft October 5, 2025 15:24
@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

@Zouxxyy Zouxxyy changed the title [CORE] Refactor BasicScanExecTransformer to make it accept Seq[Seq[InputPartition]] as InputPartitions [CORE] Make LeafTransformSupport's getPartitions return Seq[Partition] Oct 5, 2025
@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 5, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link

github-actions bot commented Oct 6, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 6, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 6, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

@Zouxxyy Zouxxyy marked this pull request as ready for review October 8, 2025 13:35
@zhztheplayer
Copy link
Member

Thanks.

Should we also refactor GlutenWholeStageColumnarRDD? Does soft-affinity still work correctly after the change?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Oct 8, 2025

Should we also refactor GlutenWholeStageColumnarRDD?

For GlutenWholeStageColumnarRDD, I think these modifications below are sufficient. Do you think it needs to be modified in this PR as well? (Besieds. NativeFileScanColumnarRDD doesn't seem to be used. Should I delete it?)

case class FirstZippedPartitionsPartition(
    index: Int,
    inputPartition: Partition,
    inputColumnarRDDPartitions: Seq[Partition] = Seq.empty)
  extends Partition

class GlutenWholeStageColumnarRDD(
    @transient sc: SparkContext,
    @transient private val inputPartitions: Seq[Partition],

Does soft-affinity still work correctly after the change?

do you mean getPreferredLocations? I haven't modified its logic, so it still applies

e.g. in GlutenWholeStageColumnarRDD, It will do the correct cast.

  override def getPreferredLocations(split: Partition): Seq[String] = {
    castNativePartition(split)._1.preferredLocations()
  }

Or just like FileScanRDD in apache spark

  override protected def getPreferredLocations(split: RDDPartition): Seq[String] = {
    split.asInstanceOf[FilePartition].preferredLocations()
  }

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

zhztheplayer commented Oct 10, 2025

For GlutenWholeStageColumnarRDD, I think these modifications below are sufficient. Do you think it needs to be modified in this PR as well?

Yes and I saw you're working on this. Thanks.

Besieds. NativeFileScanColumnarRDD doesn't seem to be used. Should I delete it?

For sure. Thanks.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

})

val allSplitInfos = getSplitInfosFromPartitions(isKeyGroupPartition, leafTransformers)
val allSplitInfos = leafTransformers.map(_.getSplitInfos).transpose
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to preserve a similar comment like this? Which will help user understand the .transpose usage here.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Oct 16, 2025

@zhztheplayer I rebased and resolved the conflict, can you have a look, thanks

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactor. The code is much clearer now.

@zhztheplayer zhztheplayer merged commit 2daa733 into apache:main Oct 16, 2025
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants